$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ $\newcommand{\amax}{{\text{argmax}}}$ $\newcommand{\P}{{\mathbb{P}}}$ $\newcommand{\E}{{\mathbb{E}}}$ $\newcommand{\R}{{\mathbb{R}}}$ $\newcommand{\Z}{{\mathbb{Z}}}$ $\newcommand{\N}{{\mathbb{N}}}$ $\newcommand{\C}{{\mathbb{C}}}$ $\newcommand{\abs}[1]{{ \left| #1 \right| }}$ $\newcommand{\simpl}[1]{{\Delta^{#1} }}$

Anomaly Detection via Reconstruction Error

Snow

import numpy as np
import itertools as it
from tqdm import tqdm

import matplotlib
from matplotlib import pyplot as plt
import plotly.express as px
import pandas as pd

import ipywidgets as widgets

from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \
perform_rkde_experiment, get_mnist_data

from ipywidgets import interact

from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import IsolationForest
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KernelDensity

from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst

from tensorflow import keras

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (5, 5)

Anomaly Detection via Reconstruction Error

Idea: Embed the data into low dimensional space and reconstruct it again. Good embedding of nominal data $\Rightarrow$ high reconstruction error indicates anomaly.

Autoencoder:

  • Parametric family of encoders: $f_\phi: \mathbb{R}^d \to \mathbb{R}^{\text{low}}$
  • Parametric family of decoders: $g_\theta: \mathbb{R}^{\text{low}} \to \mathbb{R}^{d}$
  • Reconstruction error of $(f_\phi, g_\theta)$ on $x$: $|x - g_\theta(f_\phi(x))|$
  • Given data set $D$, find $\phi,\theta$ that minimize $\sum_{x\in D} L(|x- g_\theta(f_\phi(x))|) $ for some loss function $L$.

Visualization

Autoencoder Schema

Neural Networks

Neural networks are very well suited for finding low dimensional representations of data. Hence they are a popular choice for the encoder and the decoder.

Artificial Neuron with $N$ inputs: $y = \sigma\left(\sum_i^N w_i X_i + b\right)$

  • $\sigma$: nonlinear activation-function (applied component wise).
  • $b$ bias
Isolation depth of nominal point and anomaly

Neural Networks

Neural networks combine many artificial neurons into a complex network. These networks are usually organized in layers where the result of each layer is the input for the next layer. Some commonly used layers are:

Variational Autoencoders

An important extension of autoencoders that relates the idea to density estimation. More precisely, we define a generative model for our data using latent variables and combine the maximum likelihood estimation of the parameters with a simultaneous posterior estimation of the latents through amortized stochastic variational inference. We use a decoder network to transform the latent variables into the data distribution, and an encoder network to compute the posterior distribution of the latents given the data.

Definition:

The model uses an observed variable $X$ (the data) and a latent variable $Z$ (the defining features of $X$). We assume both $P(Z)$ and $P(X\mid Z)$ to be normally distributed. More precisely

  • $P(Z) = \mathcal{N}(0, I)$
  • $P(X\mid Z) = \mathcal{N}(\mu_\phi(Z), I)$

where $\mu_\phi$ is a neural network parametrized with $\phi$. We use variational inference to perform posterior inference on $Z$ given $X$. We assume that the distribution $P(Z\mid X)$ to be relatively well approximated by a Gaussian and use the posterior approximation:

  • $q(X\mid Z) = \mathcal{N}(\mu_\psi(X), \sigma_\psi(X))$

$\mu_\psi$ and $\sigma_\psi$ are neural networks parameterized with $\psi$

Given a data set $D$ we minimize the (amortized) Kullback-Leibler divergence between our posterior approximation and the true posterior: \begin{align*} D_{KL}(q(z\mid x),p(z\mid x)) &= E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(z \mid X)}\right)\right] \\ &= E_{x\sim X, z\sim q(Z\mid X)}\left[\log\left(\frac{q(z \mid x)}{\frac{p(x \mid z)p(z)}{p(x)}}\right)\right] \\ &= E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(x \mid z)p(z)}\right) + \log(p(x))\right] \\ &= E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(x \mid z)p(z)}\right)\right] + E_{x\sim X}[\log(p(x))]\\ \end{align*}

Now we can define

\begin{align*} \mathrm{ELBO}(q(z\mid x),p(z\mid x)) &:= E_{x\sim X}[\log(p(x))] - D_{KL}(q(z\mid x),p(z\mid x)) \\ &= -E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(x \mid z)p(z)}\right)\right] \end{align*}

Note that we can evaluate the expression inside the expectation of the final RHS of the equation and we can obtain unbiased estmates of the expectation via sampling. Let us further try to understand the ELBO as an optimization objective. On one hand, maximizing the ELBO with respect to the parameters in $q$ is equivalent to minimizing the KL divergence between $p$ and $q$. On the other hand, maximizing the ELBO with respect to the parameters in $p$ can be understood as raising a lower bound for the likelihood of the generative model $p(x)$. Hence, the optimization tries to find an encoder and a decoder pair such that it simultaneously provides a good generative explanation of the data and a good approximation of the posterior distribution of the latent variables.

Exercise

The MNIST Data Set

MNIST is one of the most iconic data sets in the history of machine learning. It contains 70000 samples of $28\times 28$ grayscale images of handwritten digits. Because of its moderate complexity and good visualizability it is well suited to study the behavior of machine learning algorithms in higher dimensional spaces.

While originally created for classification (optical character recognition), we can build an anomaly detection data set by corrupting some of the images.

Pre-processing

We first need to obtain the MNIST data set and prepare an anomaly detection set from it. Note that the data set is n row vector format. Therefore, we work with $28\times 28 = 784$ dimensional data points.

Load MNIST Data Set

mnist = get_mnist_data()

data = mnist['data']
print('data.shape: {}'.format(data.shape))
target = mnist['target'].astype(int)
data.shape: (70000, 784)

Build contaminated Data Sets

We prepared a function that does the job for us. It corrupts a prescribed portion of the data by introducing a rotation, noise or a blackout of some part of the image.

First, we need to transform the data into image format.

X = data.reshape(-1, 28, 28, 1)/255

Train/Test-Split

We will only corrupt the test set, hence we will perform the train-test split beforehand. We separate a relatively small test set so that we can use as much as possible from the data to obtain high quality representations.

test_size = .1
X_train, X_test, target_train, target_test = train_test_split(X, target, test_size=test_size)
X_test, y_test = build_contaminated_minst(X_test)

# Visualize contamination
anomalies = X_test[y_test != 0]
selection = np.random.choice(len(anomalies), 25)

fig, axes = plt.subplots(nrows=5, ncols=5, figsize=(5, 5))
for img, ax in zip(anomalies[selection], axes.flatten()):
    ax.imshow(img, 'gray')
    ax.axis('off')
plt.show()

Autoencoder

Let us finally train an autoencoder model. We replicate the model given in the Keras documentation and apply it in a synthetic outlier detection scenario based on MNIST.

in the vae package we provide the implementation of the VAE. Please take a look into the source code to see how the minimization of the KL divergence is implemented.

Create Model

latent_dim = 3
vae = VAE(decoder=build_decoder_mnist(latent_dim=latent_dim), encoder=build_encoder_minst(latent_dim=latent_dim))
2023-04-21 23:27:53.570842: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
## Inspect model architecture
vae.encoder.summary()
Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            [(None, 28, 28, 1)]  0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 14, 14, 32)   320         input_2[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 7, 7, 64)     18496       conv2d[0][0]                     
__________________________________________________________________________________________________
flatten (Flatten)               (None, 3136)         0           conv2d_1[0][0]                   
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 16)           50192       flatten[0][0]                    
__________________________________________________________________________________________________
z_mean (Dense)                  (None, 3)            51          dense_1[0][0]                    
__________________________________________________________________________________________________
z_log_var (Dense)               (None, 3)            51          dense_1[0][0]                    
__________________________________________________________________________________________________
sampling (Sampling)             (None, 3)            0           z_mean[0][0]                     
                                                                 z_log_var[0][0]                  
==================================================================================================
Total params: 69,110
Trainable params: 69,110
Non-trainable params: 0
__________________________________________________________________________________________________
## Inspect model architecture
vae.decoder.summary()
Model: "decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 3)]               0         
_________________________________________________________________
dense (Dense)                (None, 3136)              12544     
_________________________________________________________________
reshape (Reshape)            (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_transpose (Conv2DTran (None, 14, 14, 64)        36928     
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 28, 28, 32)        18464     
_________________________________________________________________
conv2d_transpose_2 (Conv2DTr (None, 28, 28, 1)         289       
=================================================================
Total params: 68,225
Trainable params: 68,225
Non-trainable params: 0
_________________________________________________________________
# train model
n_epochs = 30

vae.compile(optimizer=keras.optimizers.Adam(learning_rate=.001))
history = vae.fit(X_train, epochs=n_epochs, batch_size=128)
2023-04-21 23:27:54.188016: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/30
493/493 [==============================] - 35s 70ms/step - loss: 42.5097 - reconstruction_loss: 34.5901 - kl_loss: 0.6970
Epoch 2/30
493/493 [==============================] - 34s 69ms/step - loss: 31.0599 - reconstruction_loss: 28.5956 - kl_loss: 2.1009
Epoch 3/30
493/493 [==============================] - 34s 69ms/step - loss: 30.1241 - reconstruction_loss: 27.5431 - kl_loss: 2.4653
Epoch 4/30
493/493 [==============================] - 34s 68ms/step - loss: 29.7932 - reconstruction_loss: 27.0414 - kl_loss: 2.6726
Epoch 5/30
493/493 [==============================] - 35s 71ms/step - loss: 29.5701 - reconstruction_loss: 26.7072 - kl_loss: 2.8093
Epoch 6/30
493/493 [==============================] - 34s 68ms/step - loss: 29.4335 - reconstruction_loss: 26.5000 - kl_loss: 2.8876
Epoch 7/30
493/493 [==============================] - 37s 75ms/step - loss: 29.3880 - reconstruction_loss: 26.3282 - kl_loss: 2.9572
Epoch 8/30
493/493 [==============================] - 34s 68ms/step - loss: 29.2236 - reconstruction_loss: 26.2072 - kl_loss: 3.0092
Epoch 9/30
493/493 [==============================] - 33s 67ms/step - loss: 29.1663 - reconstruction_loss: 26.0906 - kl_loss: 3.0563
Epoch 10/30
493/493 [==============================] - 33s 67ms/step - loss: 29.0762 - reconstruction_loss: 25.9887 - kl_loss: 3.0914
Epoch 11/30
493/493 [==============================] - 33s 67ms/step - loss: 29.0782 - reconstruction_loss: 25.8939 - kl_loss: 3.1354
Epoch 12/30
493/493 [==============================] - 33s 68ms/step - loss: 28.9933 - reconstruction_loss: 25.7886 - kl_loss: 3.1714
Epoch 13/30
493/493 [==============================] - 34s 68ms/step - loss: 29.0340 - reconstruction_loss: 25.7205 - kl_loss: 3.2132
Epoch 14/30
493/493 [==============================] - 34s 69ms/step - loss: 28.9053 - reconstruction_loss: 25.6475 - kl_loss: 3.2307
Epoch 15/30
493/493 [==============================] - 33s 68ms/step - loss: 28.8405 - reconstruction_loss: 25.5563 - kl_loss: 3.2630
Epoch 16/30
493/493 [==============================] - 34s 68ms/step - loss: 28.8373 - reconstruction_loss: 25.5013 - kl_loss: 3.2902
Epoch 17/30
493/493 [==============================] - 34s 69ms/step - loss: 28.8136 - reconstruction_loss: 25.4411 - kl_loss: 3.3181
Epoch 18/30
493/493 [==============================] - 35s 71ms/step - loss: 28.7450 - reconstruction_loss: 25.3897 - kl_loss: 3.3306
Epoch 19/30
493/493 [==============================] - 34s 69ms/step - loss: 28.8071 - reconstruction_loss: 25.3357 - kl_loss: 3.3655
Epoch 20/30
493/493 [==============================] - 33s 67ms/step - loss: 28.6384 - reconstruction_loss: 25.2765 - kl_loss: 3.3830
Epoch 21/30
493/493 [==============================] - 33s 67ms/step - loss: 28.6931 - reconstruction_loss: 25.2556 - kl_loss: 3.3926
Epoch 22/30
493/493 [==============================] - 33s 67ms/step - loss: 28.6249 - reconstruction_loss: 25.2039 - kl_loss: 3.4114
Epoch 23/30
493/493 [==============================] - 33s 67ms/step - loss: 28.6697 - reconstruction_loss: 25.1551 - kl_loss: 3.4301
Epoch 24/30
493/493 [==============================] - 33s 67ms/step - loss: 28.6615 - reconstruction_loss: 25.1161 - kl_loss: 3.4372
Epoch 25/30
493/493 [==============================] - 33s 67ms/step - loss: 28.6019 - reconstruction_loss: 25.0742 - kl_loss: 3.4602
Epoch 26/30
493/493 [==============================] - 33s 67ms/step - loss: 28.5390 - reconstruction_loss: 25.0427 - kl_loss: 3.4624
Epoch 27/30
493/493 [==============================] - 34s 68ms/step - loss: 28.5158 - reconstruction_loss: 25.0139 - kl_loss: 3.4798
Epoch 28/30
493/493 [==============================] - 33s 68ms/step - loss: 28.5666 - reconstruction_loss: 24.9772 - kl_loss: 3.4934
Epoch 29/30
493/493 [==============================] - 34s 68ms/step - loss: 28.5249 - reconstruction_loss: 24.9418 - kl_loss: 3.5118
Epoch 30/30
493/493 [==============================] - 33s 67ms/step - loss: 28.4691 - reconstruction_loss: 24.8969 - kl_loss: 3.5315

Inspect Result

import matplotlib.pyplot as plt


def plot_latent_space(vae: VAE, n: int=10, figsize: float=10):
    """Plot sample images from 2D slices of latent space
    
    @param vae: vae model
    @param n: sample nXn images per slice
    @param figsize: figure size
    
    """
    for perm in [[0, 1, 2], [1, 2, 0], [2, 1, 0]]:
        # display a n*n 2D manifold of digits
        digit_size = 28
        scale = 1.0
        figure = np.zeros((digit_size * n, digit_size * n))
        # linearly spaced coordinates corresponding to the 2D plot
        # of digit classes in the latent space
        grid_x = np.linspace(-scale, scale, n)
        grid_y = np.linspace(-scale, scale, n)[::-1]

        for i, yi in enumerate(grid_y):
            for j, xi in enumerate(grid_x):
                z_sample = np.array([[xi, yi, 0]])
                z_sample[0] = z_sample[0][perm]
                x_decoded = vae.decoder.predict(z_sample)
                digit = x_decoded[0].reshape(digit_size, digit_size)
                figure[
                    i * digit_size : (i + 1) * digit_size,
                    j * digit_size : (j + 1) * digit_size,
                ] = digit

        plt.figure(figsize=(figsize, figsize))
        start_range = digit_size // 2
        end_range = n * digit_size + start_range
        pixel_range = np.arange(start_range, end_range, digit_size)
        sample_range_x = np.round(grid_x, 1)
        sample_range_y = np.round(grid_y, 1)
        plt.xticks(pixel_range, sample_range_x)
        plt.yticks(pixel_range, sample_range_y)
        plt.xlabel("z[{}]".format(perm[0]))
        plt.ylabel("z[{}]".format(perm[1]))
        plt.gca().set_title('z[{}] = 0'.format(perm[2]))
        plt.imshow(figure, cmap="Greys_r")
        plt.show()
plot_latent_space(vae)
# Principal components
pca = PCA()
latents = vae.encoder.predict(X_train)[2]
pca.fit(latents)

kwargs = {'x_{}'.format(i): (-1., 1.) for i in range(latent_dim)}


@widgets.interact(**kwargs)
def explore_latent_space(**kwargs):
    """Widget to explore latent space from given start position
    """
    center_img = pca.transform(np.zeros([1,latent_dim]))

    latent_rep_pca =  center_img + np.array([[kwargs[key] for key in kwargs]])
    latent_rep = pca.inverse_transform(latent_rep_pca)
    img = vae.decoder(latent_rep).numpy().reshape(28, 28)

    fig, ax = plt.subplots()
    ax.axis('off')
    ax.axis('off')

    ax.imshow(img,cmap='gray', vmin=0, vmax=1)
    plt.show()
latents = vae.encoder.predict(X_train)[2]
scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=target_train)

scatter.show()
latents = vae.encoder.predict(X_test)[2]
scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=y_test)

scatter.show()
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test)
n_samples = 10

s = np.random.choice(range(len(X_val)), n_samples)
s = X_val[s]
#s = [X_train_img[i] for i in s]

fig, axes = plt.subplots(nrows=2, ncols=n_samples, figsize=(10, 2))
for img, ax_row in zip(s, axes.T):
    x = vae.decoder.predict(vae.encoder.predict(img.reshape(1, 28, 28, 1))[2]).reshape(28, 28)
    diff = x - img.reshape(28, 28)
    error = (diff * diff).sum()
    ax_row[0].axis('off')
    ax_row[1].axis('off')
    ax_row[0].imshow(img,cmap='gray', vmin=0, vmax=1)
    ax_row[1].imshow(x, cmap='gray', vmin=0, vmax=1)
    ax_row[1].set_title('E={:.1f}'.format(error))

plt.tight_layout()
plt.show()
from sklearn import metrics
y_test_bin = y_test.copy()
y_test_bin[y_test != 0] = 1
y_val_bin = y_val.copy()
y_val_bin[y_val != 0] = 1
# Evaluate
reconstruction = vae.decoder.predict(vae.encoder(X_val)[2])
rerrors = (reconstruction - X_val).reshape(-1, 28*28)
rerrors = (rerrors * rerrors).sum(axis=1)

# Let's calculate scores if any anomaly is present
if np.any(y_val_bin == 1):
    eval = evaluate(y_val_bin.astype(int), rerrors.astype(float))
    pr, rec, thr = eval['PR']
    f1s = (2 * ((pr * rec)[:-1]/(pr + rec)[:-1]))
    threshold = thr[np.argmax(f1s)]
    print('Optimal threshold: {}'.format(threshold))

    reconstruction = vae.decoder.predict(vae.encoder(X_test)[2])
    reconstruction_error = (reconstruction - X_test).reshape(-1, 28*28)
    reconstruction_error = (reconstruction_error * reconstruction_error).sum(axis=1)


    classification = (reconstruction_error > threshold).astype(int)

    print('Precision: {}'.format(metrics.precision_score(y_test_bin, classification)))
    print('Recall: {}'.format(metrics.recall_score(y_test_bin, classification)))
    print('F1: {}'.format(metrics.f1_score(y_test_bin, classification)))

    metrics.confusion_matrix(y_test_bin, classification)
else:
    reconstruction_error = None
Optimal threshold: 88.6082652988337
Precision: 0.6266666666666667
Recall: 0.44339622641509435
F1: 0.5193370165745856

Sort Data by Reconstruction Error

if reconstruction_error is not None:
    combined = list(zip(X_test, reconstruction_error))
    combined.sort(key = lambda x: x[1])

Show Top Autoencoder Outliers

if reconstruction_error is not None:
    n_rows = 10
    n_cols = 10
    n_samples = n_rows*n_cols

    samples = [c[0] for c in combined[-n_samples:]]

    fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(2*n_cols, 2*n_rows))
    for img, ax in zip(samples, axes.reshape(-1)):
        ax.axis('off')
        ax.imshow(img.reshape((28,28)), cmap='gray', vmin=0, vmax=1)

    plt.show()

Summary

  • Autoencoders are the most prominent reconstruction error based anomaly detection method.
  • Can provide high quality results on high dimensional data.
  • Architecture is highly adaptable to the data (fully connected, CNN, attention,...).
  • Sensitive to contamination.
  • Variational autoencoder are an important variant the improves the interpretability of the latent space.

Implementations

Snow